Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence
نویسندگان
چکیده
This work investigates the use of focussed crawling techniques for the discovery of environmental multimedia Web resources that provide air quality measurements and forecasts. Focussed crawlers automatically navigate the hyperlinked structure of the Web and select the hyperlinks to follow by estimating their relevance to a given topic, based on evidence obtained from the already downloaded pages. Given that air quality measurements and particularly air quality forecasts are presented not only in textual form, but are most commonly encoded as multimedia, mainly in the form of heatmaps, we propose the combination of textual and visual evidence for predicting the benefit of fetching an unvisited Web resource. First, text classification is applied to select the relevant hyperlinks based on their anchor text, a surrounding text window, and URL terms. Further hyperlinks are selected by combining their text classification score with an image classification score that indicates the presence of heatmaps in their source page. A pilot evaluation indicates that the combination of textual and visual evidence results in improvements in the crawling precision over the use of textual features alone.
منابع مشابه
Georeferencing Semi-Structured Place-Based Web Resources Using Machine Learning
In recent years, the shared content on the web has had significant growth. A great part of these information are publicly available in the form of semi-strunctured data. Moreover, a significant amount of these information are related to place. Such types of information refer to a location on the earth, however, they do not contain any explicit coordinates. In this research, we tried to georefer...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملHybrid focused crawling on the Surface and the Dark Web
Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of interest. This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed ...
متن کاملتشخیص ناهنجاری روی وب از طریق ایجاد پروفایل کاربرد دسترسی
Due to increasing in cyber-attacks, the need for web servers attack detection technique has drawn attentions today. Unfortunately, many available security solutions are inefficient in identifying web-based attacks. The main aim of this study is to detect abnormal web navigations based on web usage profiles. In this paper, comparing scrolling behavior of a normal user with an attacker, and simu...
متن کاملEye-Tracking Method’ Usage for Understanding the Cognitive Processes in Multimedia Learning
Introduction: Designing multimedia learning environments should consist of the evidence-based study and principals about the human learning process. Eye tracking is a way based on the learner processing of learning materials which presented in multimedia learning environments. The aim of the study was to examine the use of the eye-tracking method to investigate the cognitive processes in m...
متن کامل